Off-Line Dictionary-Based Compression The dictionary-based compression methods
ثبت نشده
چکیده
The dictionary-based compression methods described in Chapter 3 of the book are different, but have one thing in common; they generate the dictionary as they go along, reading data and compressing it. The dictionary is not included in the compressed file and is generated by the decoder in lockstep with the encoder. Thus, such methods can be termed " online. " In contrast, the methods described here are also dictionary based, but can be considered " offline " because they include the dictionary in the compressed file. The first method is byte pair encoding (BPE). This is a simple compression method, due to [Gage 94], that often features only mediocre performance. It is described here because (1) it is an example of a multipass method (two-pass compression algorithms are common, but multipasses are normally considered too slow) and (2) it eliminates only certain types of redundancy and should therefore be applied only to data files that feature this redundancy. (The second method, by [Larsson and Moffat 00], does not suffer from these restrictions and is much more efficient.) BPE is both an example of an offline dictionary-based compression algorithm and a simple example (perhaps the simplest) of a grammar-based compression method. In addition, the BPE decoder is very small, which makes it an ideal candidate for applications where memory size is restricted. The BPE method is easy to understand. We assume that the data symbols are bytes and we use the term bigram for a pair of consecutive bytes. Each pass locates the most-common bigram and replaces it with an unused byte value. Thus, the method performs best on files that have many unused byte values, and one aim of this document is to show what types of data feature this kind of redundancy. First, however, a small example. Given the character set A, B, C, D, X, and Y and the data file ABABCABCD (where X and Y are unused bytes), the first pass identifies the pair AB as the most-common bigram and replaces each of its three occurrences with the single byte X. The result is XXCXCD. The second pass identifies the pair XC as the most-common bigram and replaces each of its two occurrences with the single byte Y. The result is XYYD, where every bigram occurs just once. Bigrams that occur just once can also be replaced, if more unused byte values are available. However, each …
منابع مشابه
Data Compression Using a Dictionary of Patterns
Most modern lossless data compression techniques used today, are based in dictionaries. If some string of data being compressed matches a portion previously seen, then such string is included in the dictionary and its reference is included every time it occurs. A possible generalization of this scheme is to consider not only strings made of consecutive symbols, but more general patterns with ga...
متن کاملDNA Sequence Compression Using the Burrows-Wheeler Transform
We investigate off-line dictionary oriented approaches to DNA sequence compression, based on the Burrows-Wheeler Transform (BWT). The preponderance of short repeating patterns is an important phenomenon in biological sequences. Here, we propose off-line methods to compress DNA sequences that exploit the different repetition structures inherent in such sequences. Repetition analysis is performed...
متن کاملJBIG2 Symbol Dictionary Design Based on Minimum Spanning Trees
The JBIG2 standard is a very flexible bi-level image coding strategy based on pattern matching. The encoder collects a set of symbols in a dictionary and encodes a page by reference to the dictionary symbols. JBIG2 allows the encoder to view all symbols and choose a good set for the dictionary. In this paper, we examine the bit rate trade-off that arises in choosing different dictionary sizes. ...
متن کاملFrequent Pattern Compression: A Significance-Based Compression Scheme for L2 Caches
With the widening gap between processor and memory speeds, memory system designers may find cache compression beneficial to increase cache capacity and reduce off-chip bandwidth. Most hardware compression algorithms fall into the dictionary-based category, which depend on building a dictionary and using its entries to encode repeated data values. Such algorithms are effective in compressing lar...
متن کاملDictionary design for text image compression with JBIG2
The JBIG2 standard for lossy and lossless bi-level image coding is a very flexible encoding strategy based on pattern matching techniques. This paper addresses the problem of compressing text images with JBIG2. For text image compression, JBIG2 allows two encoding strategies: SPM and PM&S. We compare in detail the lossless and lossy coding performance using the SPM-based and PM&S-based JBIG2, i...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007